import warnings
# Ignore future warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# First, let's load the dataset and examine its structure to proceed with the analysis.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
import calendar
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor, AdaBoostRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
# Load the dataset
file_path = '/content/Final Data.csv'
data = pd.read_csv(file_path)
# Display the first few rows of the dataset to understand its structure and variables
data.head()
| Date | Cabbage | Capsicum | Cauliflower | Chicken | Cucumber | eddoe | Egg | Egg_Plant | Garlic | ... | Pakistani_Banana | Potato | Pumpkin | Tomato | Zucchini | Economic_Policy_Uncertainty_Index | Temperature | Rain | Petrol_Prices | USD_PKR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 01/01/2017 | 30 | 64 | 26 | 112 | 48 | 60 | 115 | 28 | 296 | ... | 75 | 27 | 44 | 26 | 27 | 45.36 | 20 | 0 | 66.51 | 104.375 |
| 1 | 02/01/2017 | 36 | 67 | 23 | 112 | 44 | 70 | 116 | 40 | 330 | ... | 80 | 26 | 50 | 28 | 26 | 45.36 | 19 | 0 | 66.51 | 104.750 |
| 2 | 03/01/2017 | 35 | 65 | 24 | 112 | 41 | 74 | 118 | 40 | 330 | ... | 88 | 27 | 60 | 30 | 27 | 45.36 | 17 | 1 | 66.51 | 104.785 |
| 3 | 04/01/2017 | 35 | 68 | 25 | 115 | 44 | 70 | 118 | 52 | 330 | ... | 90 | 28 | 65 | 32 | 25 | 45.36 | 15 | 1 | 66.51 | 104.810 |
| 4 | 05/01/2017 | 33 | 66 | 23 | 97 | 42 | 72 | 121 | 40 | 335 | ... | 85 | 27 | 50 | 29 | 25 | 45.36 | 17 | 0 | 66.51 | 104.700 |
5 rows × 24 columns
The dataset includes a variety of variables, such as prices for different vegetables (e.g., Cabbage, Capsicum, Cauliflower), Chicken, economic indicators (Economic Policy Uncertainty Index, Petrol Prices, USD to PKR exchange rate), and environmental factors (Temperature, Rain). "Potato" is identified as the dependent variable, which means the goal is to predict its prices based on other features.
# Clean the data by converting non-numeric entries to NaN
commodities = [
'Cabbage', 'Capsicum', 'Cauliflower', 'Chicken', 'Cucumber', 'eddoe',
'Egg', 'Egg_Plant', 'Garlic', 'Ginger', 'Green Chilli', 'Laddy_Finger',
'Onion', 'Pakistani_Banana', 'Potato', 'Pumpkin', 'Tomato', 'Zucchini'
]
for commodity in commodities:
data[commodity] = pd.to_numeric(data[commodity], errors='coerce')
# Continue with our previous processing
data['Date'] = pd.to_datetime(data['Date'])
data['Month'] = data['Date'].dt.month_name()
data['Year'] = data['Date'].dt.year
# Now, we will create a table that contains the average price for each commodity by month and year
table = pd.DataFrame()
# For each commodity, we calculate the average price by month
for commodity in commodities:
# Group by year and month, and calculate the average
monthly_avg = data.groupby(['Year', 'Month'])[commodity].mean().reset_index()
# Merge into the table DataFrame
if table.empty:
table = monthly_avg
else:
table = table.merge(monthly_avg, on=['Year', 'Month'], how='outer')
# Pivot the table to have months as rows and years as columns for each commodity
pivot_table = table.pivot('Month', 'Year')
# Sort the pivot_table index based on calendar month order
months_order = ['January', 'February', 'March', 'April', 'May', 'June',
'July', 'August', 'September', 'October', 'November', 'December']
pivot_table = pivot_table.reindex(months_order)
print(pivot_table) # Display the first few rows of the pivoted table
pivot_table.to_excel('commodity_prices.xlsx')
<ipython-input-5-b0250fb632da>:12: UserWarning: Parsing dates in DD/MM/YYYY format when dayfirst=False (the default) was specified. This may lead to inconsistently parsed dates! Specify a format to ensure consistent parsing. data['Date'] = pd.to_datetime(data['Date'])
Cabbage \
Year 2017 2018 2019 2020 2021 2022
Month
January 30.580645 34.683333 35.901639 23.419355 28.032258 58.161290
February 28.285714 35.555556 27.481481 24.724138 27.178571 76.714286
March 27.064516 32.766667 28.200000 22.806452 24.225806 65.903226
April 25.966667 39.181818 48.413793 24.266667 32.600000 68.200000
May 23.903226 33.666667 53.700000 28.354839 29.774194 77.387097
June 29.733333 49.310345 51.827586 38.033333 38.366667 125.033333
July 24.129032 55.933333 44.000000 45.516129 51.580645 93.967742
August 39.161290 63.066667 37.466667 42.354839 56.806452 85.032258
September 34.000000 45.689655 45.344828 35.300000 48.266667 74.700000
October 39.580645 42.000000 41.433333 33.806452 43.677419 99.677419
November 40.933333 34.827586 34.137931 37.933333 36.766667 82.666667
December 41.096774 30.300000 28.933333 38.516129 37.064516 56.516129
Capsicum ... Tomato \
Year 2017 2018 2019 2020 ... 2019
Month ...
January 79.193548 64.850000 79.803279 60.193548 ... 48.918033
February 87.250000 111.629630 102.074074 71.793103 ... 106.259259
March 68.645161 64.166667 95.366667 62.935484 ... 79.300000
April 58.866667 92.000000 69.724138 50.033333 ... 68.482759
May 59.516129 65.900000 56.200000 59.903226 ... 49.066667
June 61.366667 60.586207 77.689655 68.466667 ... 52.068966
July 71.741935 90.566667 90.636364 74.483871 ... 70.636364
August 85.354839 110.166667 86.233333 84.548387 ... 55.266667
September 86.300000 114.962963 115.851852 89.566667 ... 65.035714
October 101.096774 98.400000 97.333333 111.612903 ... 64.833333
November 111.000000 83.758621 82.965517 96.766667 ... 49.551724
December 99.838710 87.466667 85.866667 94.677419 ... 48.700000
Zucchini \
Year 2020 2021 2022 2017 2018 2019
Month
January 44.129032 56.129032 92.709677 36.387097 43.150000 50.262295
February 46.000000 40.000000 124.964286 31.357143 46.333333 48.185185
March 44.354839 54.580645 100.935484 31.903226 38.466667 51.233333
April 40.300000 47.100000 101.700000 33.133333 54.181818 48.928571
May 33.161290 42.225806 100.548387 28.838710 37.633333 29.133333
June 43.166667 45.000000 97.600000 34.900000 45.586207 40.586207
July 63.612903 69.612903 89.096774 49.677419 70.800000 49.727273
August 47.193548 62.225806 125.838710 54.741935 58.566667 47.066667
September 43.400000 54.033333 131.066667 54.566667 68.000000 66.103448
October 46.258065 74.870968 165.903226 41.903226 71.300000 67.833333
November 53.600000 99.533333 157.200000 38.033333 42.310345 40.068966
December 55.483871 67.064516 86.419355 52.258065 40.933333 38.233333
Year 2020 2021 2022
Month
January 35.870968 44.645161 58.193548
February 34.000000 40.571429 60.071429
March 33.580645 41.870968 53.064516
April 30.000000 35.066667 51.133333
May 33.645161 47.000000 50.967742
June 48.766667 67.933333 71.600000
July 64.032258 83.161290 86.677419
August 50.548387 93.838710 104.258065
September 49.233333 91.733333 114.433333
October 52.935484 76.548387 68.677419
November 41.133333 50.333333 57.400000
December 40.806452 48.709677 61.290323
[12 rows x 108 columns]
# Seasonal Trends in Potato Prices
plt.figure(figsize=(12, 6))
data['Month'] = pd.to_datetime(data['Date']).dt.month_name()
sns.lineplot(x='Month', y='Potato', data=data, estimator='mean')
plt.xticks(rotation=45)
plt.title('Seasonal Trends in Potato Prices')
plt.show()
# Seasonal Trends in Potato Prices
plt.figure(figsize=(12, 6))
data['Month'] = pd.to_datetime(data['Date']).dt.month_name()
sns.lineplot(x='Month', y='Potato', data=data, estimator='mean')
plt.xticks(rotation=45)
plt.title('Seasonal Trends in Potato Prices')
plt.show()
# Our DataFrame is named 'data' and we have 'Potato' prices along with 'Temperature' and 'Rain'
plt.figure(figsize=(12, 6))
# Make sure the column names match exactly with those in we DataFrame
# If 'Temperature' is not the correct column name, replace it with the correct one from our dataset
sns.scatterplot(x=data[' Temperature'], y=data['Potato'], color='blue', label='Temperature')
# Similarly, make sure 'Rain' is the correct column name for rainfall data
sns.scatterplot(x=data['Rain'], y=data['Potato'], color='red', label='Rain')
plt.title('Impact of Weather on Potato Prices')
plt.xlabel('Weather Variables')
plt.ylabel('Potato Price')
plt.legend()
plt.show()
# Inter-Commodity Price Relationship
# This would typically be a pairplot, but for simplicity, we'll do a correlation heatmap
commodities = ['Potato', 'Onion', 'Chicken', 'Egg'] #
plt.figure(figsize=(10, 8))
sns.heatmap(data[commodities].corr(), annot=True, fmt=".2f", cmap='coolwarm')
plt.title('Inter-Commodity Price Relationship')
plt.show()
data['Date'] = pd.to_datetime(data['Date'])
data['Month'] = data['Date'].dt.month
monthly_avg = data.groupby('Month')['Potato'].mean()
plt.figure(figsize=(10, 5))
monthly_avg.plot(kind='bar', color='skyblue')
plt.title('Average Monthly Potato Prices')
plt.xlabel('Month')
plt.ylabel('Average Price')
plt.xticks(ticks=range(0, 12), labels=[calendar.month_abbr[i+1] for i in range(0, 12)], rotation=45)
plt.show()
# Ensure 'Date' is a datetime object
data['Date'] = pd.to_datetime(data['Date'])
data['Year'] = data['Date'].dt.year
data['Month'] = data['Date'].dt.month
# Calculate monthly average
monthly_avg_prices = data.groupby(['Year', 'Month'])['Potato'].mean().reset_index()
# Find the min and max for each year
min_prices = monthly_avg_prices.loc[monthly_avg_prices.groupby('Year')['Potato'].idxmin()]
max_prices = monthly_avg_prices.loc[monthly_avg_prices.groupby('Year')['Potato'].idxmax()]
# Prepare the data for plotting
plot_data = pd.merge(min_prices, max_prices, on='Year', suffixes=['_min', '_max'])
# Create figure and axis objects
fig, ax = plt.subplots(figsize=(15, 7))
# Plot min prices
ax.bar(plot_data['Year'] - 0.2, plot_data['Potato_min'], color='skyblue', width=0.4, label='Min Avg Price', align='center')
# Plot max prices
ax.bar(plot_data['Year'] + 0.2, plot_data['Potato_max'], color='salmon', width=0.4, label='Max Avg Price', align='center')
# Set the labels and titles
ax.set_xlabel('Year', fontsize=14)
ax.set_ylabel('Average Monthly Potato Price', fontsize=14)
ax.set_title('Min and Max Average Monthly Potato Prices by Year', fontsize=16)
ax.set_xticks(plot_data['Year'])
ax.legend()
# Display the plot
plt.show()
# Make sure 'Date' is a datetime column
data['Date'] = pd.to_datetime(data['Date'])
# Extract month and year from Date
data['Month'] = data['Date'].dt.month
data['Year'] = data['Date'].dt.year
# Pivot the data to have years as columns and months as rows
monthly_prices = data.pivot_table(index='Month', columns='Year', values='Potato')
# Create a new figure with a specific size
plt.figure(figsize=(15, 7))
# Plot each year's data on the same axis
for year in monthly_prices.columns:
plt.plot(calendar.month_abbr[1:], monthly_prices[year], label=year)
# Add title and labels
plt.title('Monthly Potato Prices Over Years')
plt.xlabel('Month')
plt.ylabel('Potato Price')
# Add legend
plt.legend(title='Year')
# Show the plot
plt.show()
# Convert 'Date' to datetime and extract the month
data['Date'] = pd.to_datetime(data['Date'])
data['Month'] = data['Date'].dt.month_name()
# List of commodities to plot. This should match the column names of our dataset exactly.
commodities = [
'Cabbage', 'Capsicum', 'Cauliflower', 'Chicken', 'Cucumber', 'eddoe',
'Egg', 'Egg_Plant', 'Garlic', 'Ginger', 'Green Chilli', 'Laddy_Finger',
'Onion', 'Pakistani_Banana', 'Potato', 'Pumpkin', 'Tomato', 'Zucchini'
]
# Plot seasonal trends for each commodity in separate graphs
for commodity in commodities:
plt.figure(figsize=(12, 6))
sns.lineplot(x='Month', y=commodity, data=data, estimator='mean')
plt.xticks(rotation=45)
plt.title(f'Seasonal Trends in {commodity} Prices')
plt.xlabel('Month')
plt.ylabel('Average Price')
plt.show()
data['Year'] = data['Date'].dt.year
yearly_avg = data.groupby('Year')['Potato'].mean()
plt.figure(figsize=(10, 5))
yearly_avg.plot(kind='line', marker='o', color='green')
plt.title('Average Yearly Potato Prices')
plt.xlabel('Year')
plt.ylabel('Average Price')
plt.show()
data['YearMonth'] = data['Date'].dt.to_period('M')
monthly_prices = data.pivot_table(index='YearMonth', columns=data['Date'].dt.year, values='Potato')
plt.figure(figsize=(15, 7))
monthly_prices.plot(ax=plt.gca())
plt.title('Monthly Potato Prices Over Years')
plt.xlabel('Month')
plt.ylabel('Potato Price')
plt.legend(title='Year')
plt.show()
plt.figure(figsize=(12, 6))
sns.boxplot(data=data, x='Month', y='Potato', palette='coolwarm')
plt.title('Monthly Potato Price Distribution')
plt.xlabel('Month')
plt.ylabel('Potato Prices')
plt.show()
data.set_index('Date', inplace=True)
data['Potato'].rolling(window=30).mean().plot(figsize=(12, 6)) # 30-day moving average
plt.title('30-Day Moving Average of Potato Prices')
plt.xlabel('Date')
plt.ylabel('30-Day Moving Average Price')
plt.show()
data['Date'] = pd.to_datetime(data['Date'])
plt.figure(figsize=(10, 6))
plt.plot(data['Date'], data['Potato'], label='Potato Prices')
plt.xlabel('Date')
plt.ylabel('Price')
plt.title('Potato Prices Over Time')
plt.legend()
plt.show()
plt.figure(figsize=(8, 5))
sns.boxplot(data['Potato'])
plt.title('Distribution of Potato Prices')
plt.show()
corr = data.drop('Date', axis=1).corr()
plt.figure(figsize=(12, 10))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Variables')
plt.show()
sns.pairplot(data[['Potato', 'Chicken', 'Rain', 'Petrol_Prices', 'USD_PKR']])
plt.show()
plt.figure(figsize=(8, 5))
data['Economic_Policy_Uncertainty_Index'].hist(bins=30)
plt.title('Distribution of Economic Policy Uncertainty Index')
plt.xlabel('Economic Policy Uncertainty Index')
plt.ylabel('Frequency')
plt.show()
plt.figure(figsize=(8, 5))
plt.scatter(data[' Temperature'], data['Potato'])
plt.xlabel(' Temperature')
plt.ylabel('Potato Prices')
plt.title('Potato Prices vs. Temperature')
plt.show()
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) /usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3801 try: -> 3802 return self._engine.get_loc(casted_key) 3803 except KeyError as err: /usr/local/lib/python3.10/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() /usr/local/lib/python3.10/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: ' Temperature' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) <ipython-input-9-f7ed1cadc3ed> in <cell line: 2>() 1 plt.figure(figsize=(8, 5)) ----> 2 plt.scatter(data[' Temperature'], data['Potato']) 3 plt.xlabel(' Temperature') 4 plt.ylabel('Potato Prices') 5 plt.title('Potato Prices vs. Temperature') /usr/local/lib/python3.10/dist-packages/pandas/core/frame.py in __getitem__(self, key) 3805 if self.columns.nlevels > 1: 3806 return self._getitem_multilevel(key) -> 3807 indexer = self.columns.get_loc(key) 3808 if is_integer(indexer): 3809 indexer = [indexer] /usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3802 return self._engine.get_loc(casted_key) 3803 except KeyError as err: -> 3804 raise KeyError(key) from err 3805 except TypeError: 3806 # If we have a listlike key, _check_indexing_error will raise KeyError: ' Temperature'
<Figure size 800x500 with 0 Axes>
veggie_prices = data[['Cabbage', 'Capsicum', 'Cauliflower', 'Cucumber', 'Egg_Plant', 'Tomato']].mean()
veggie_prices.plot(kind='bar')
plt.title('Average Prices of Vegetables')
plt.ylabel('Average Price')
plt.xticks(rotation=45)
plt.show()
plt.figure(figsize=(10, 6))
plt.plot(data['Date'], data['Petrol_Prices'], label='Petrol Prices', color='red')
plt.xlabel('Date')
plt.ylabel('Price')
plt.title('Petrol Prices Over Time')
plt.legend()
plt.show()
plt.figure(figsize=(8, 5))
sns.violinplot(data=data, y='Rain')
plt.title('Distribution of Rainfall')
plt.show()
plt.figure(figsize=(10, 6))
plt.plot(data['Date'], data['Cabbage'], label='Cabbage')
plt.plot(data['Date'], data['Tomato'], label='Tomato')
plt.plot(data['Date'], data['Capsicum'], label='Capsicum')
plt.xlabel('Date')
plt.ylabel('Price')
plt.title('Prices of Various Vegetables Over Time')
plt.legend()
plt.show()
# Convert all columns to numeric, except for the 'Date' column, which will be ignored in the analysis
data_new = data.drop(columns=['Date']) # Dropping the Date column as it won't be used in the model
data_new = data_new.apply(pd.to_numeric, errors='coerce') # Convert all columns to numeric, errors='coerce' will set invalid parsing to NaN
# Now, let's check for any NaN values in the dataset and decide how to handle them
nan_summary = data.isna().sum()
# Display the summary of NaN values across all columns
nan_summary
Date 0 Cabbage 0 Capsicum 0 Cauliflower 0 Chicken 0 Cucumber 0 eddoe 0 Egg 0 Egg_Plant 0 Garlic 0 Ginger 0 Green Chilli 0 Laddy_Finger 0 Onion 0 Pakistani_Banana 0 Potato 0 Pumpkin 0 Tomato 0 Zucchini 0 Economic_Policy_Uncertainty_Index 0 Temperature 0 Rain 0 Petrol_Prices 0 USD_PKR 0 dtype: int64
The dataset contains missing values in several columns, including "Capsicum," "Garlic," "Green Chilli," "Laddy_Finger," "Pakistani_Banana," "Potato," "Tomato," and "Zucchini." To proceed with a clean dataset, we'll fill these missing values. Given the nature of the data (time series of commodity prices and environmental factors), a common approach would be to fill missing values with the column mean, which can maintain the overall distribution without introducing bias.
data.head()
| Date | Cabbage | Capsicum | Cauliflower | Chicken | Cucumber | eddoe | Egg | Egg_Plant | Garlic | ... | Pakistani_Banana | Potato | Pumpkin | Tomato | Zucchini | Economic_Policy_Uncertainty_Index | Temperature | Rain | Petrol_Prices | USD_PKR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 01/01/2017 | 30 | 64 | 26 | 112 | 48 | 60 | 115 | 28 | 296 | ... | 75 | 27 | 44 | 26 | 27 | 45.36 | 20 | 0 | 66.51 | 104.375 |
| 1 | 02/01/2017 | 36 | 67 | 23 | 112 | 44 | 70 | 116 | 40 | 330 | ... | 80 | 26 | 50 | 28 | 26 | 45.36 | 19 | 0 | 66.51 | 104.750 |
| 2 | 03/01/2017 | 35 | 65 | 24 | 112 | 41 | 74 | 118 | 40 | 330 | ... | 88 | 27 | 60 | 30 | 27 | 45.36 | 17 | 1 | 66.51 | 104.785 |
| 3 | 04/01/2017 | 35 | 68 | 25 | 115 | 44 | 70 | 118 | 52 | 330 | ... | 90 | 28 | 65 | 32 | 25 | 45.36 | 15 | 1 | 66.51 | 104.810 |
| 4 | 05/01/2017 | 33 | 66 | 23 | 97 | 42 | 72 | 121 | 40 | 335 | ... | 85 | 27 | 50 | 29 | 25 | 45.36 | 17 | 0 | 66.51 | 104.700 |
5 rows × 24 columns
data_new.head()
| Cabbage | Capsicum | Cauliflower | Chicken | Cucumber | eddoe | Egg | Egg_Plant | Garlic | Ginger | ... | Pakistani_Banana | Potato | Pumpkin | Tomato | Zucchini | Economic_Policy_Uncertainty_Index | Temperature | Rain | Petrol_Prices | USD_PKR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 30 | 64.0 | 26 | 112 | 48 | 60 | 115 | 28 | 296.0 | 96 | ... | 75.0 | 27 | 44 | 26.0 | 27.0 | 45.36 | 20 | 0 | 66.51 | 104.375 |
| 1 | 36 | 67.0 | 23 | 112 | 44 | 70 | 116 | 40 | 330.0 | 100 | ... | 80.0 | 26 | 50 | 28.0 | 26.0 | 45.36 | 19 | 0 | 66.51 | 104.750 |
| 2 | 35 | 65.0 | 24 | 112 | 41 | 74 | 118 | 40 | 330.0 | 100 | ... | 88.0 | 27 | 60 | 30.0 | 27.0 | 45.36 | 17 | 1 | 66.51 | 104.785 |
| 3 | 35 | 68.0 | 25 | 115 | 44 | 70 | 118 | 52 | 330.0 | 95 | ... | 90.0 | 28 | 65 | 32.0 | 25.0 | 45.36 | 15 | 1 | 66.51 | 104.810 |
| 4 | 33 | 66.0 | 23 | 97 | 42 | 72 | 121 | 40 | 335.0 | 98 | ... | 85.0 | 27 | 50 | 29.0 | 25.0 | 45.36 | 17 | 0 | 66.51 | 104.700 |
5 rows × 23 columns
cleaned_data = data_new.dropna()
cleaned_data.head()
| Cabbage | Capsicum | Cauliflower | Chicken | Cucumber | eddoe | Egg | Egg_Plant | Garlic | Ginger | ... | Pakistani_Banana | Potato | Pumpkin | Tomato | Zucchini | Economic_Policy_Uncertainty_Index | Temperature | Rain | Petrol_Prices | USD_PKR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 30 | 64.0 | 26 | 112 | 48 | 60 | 115 | 28 | 296.0 | 96 | ... | 75.0 | 27 | 44 | 26.0 | 27.0 | 45.36 | 20 | 0 | 66.51 | 104.375 |
| 1 | 36 | 67.0 | 23 | 112 | 44 | 70 | 116 | 40 | 330.0 | 100 | ... | 80.0 | 26 | 50 | 28.0 | 26.0 | 45.36 | 19 | 0 | 66.51 | 104.750 |
| 2 | 35 | 65.0 | 24 | 112 | 41 | 74 | 118 | 40 | 330.0 | 100 | ... | 88.0 | 27 | 60 | 30.0 | 27.0 | 45.36 | 17 | 1 | 66.51 | 104.785 |
| 3 | 35 | 68.0 | 25 | 115 | 44 | 70 | 118 | 52 | 330.0 | 95 | ... | 90.0 | 28 | 65 | 32.0 | 25.0 | 45.36 | 15 | 1 | 66.51 | 104.810 |
| 4 | 33 | 66.0 | 23 | 97 | 42 | 72 | 121 | 40 | 335.0 | 98 | ... | 85.0 | 27 | 50 | 29.0 | 25.0 | 45.36 | 17 | 0 | 66.51 | 104.700 |
5 rows × 23 columns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.decomposition import PCA
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt
# Assuming 'cleaned_data' is a pandas DataFrame with the target variable 'Potato'
X = cleaned_data.drop('Potato', axis=1) # Features
y = cleaned_data['Potato'] # Target variable
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Reduce the dimensionality to 2 dimensions using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)
# Fit the SVM model
model = SVR(kernel='rbf')
model.fit(X_train, y_train)
# Create a grid for plotting decision boundaries in the PCA space
h = .02 # Grid step size
x_min, x_max = X_pca[:, 0].min() - 1, X_pca[:, 0].max() + 1
y_min, y_max = X_pca[:, 1].min() - 1, X_pca[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Predict using the grid
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the contour and training points
plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolors='k', cmap=plt.cm.coolwarm)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('SVM Decision Boundary with RBF Kernel (PCA-transformed space)')
plt.colorbar()
plt.show()
The correlation matrix reveals how each feature is related to the price of potatoes. Notably, features such as "Egg," "Petrol Prices," and "Onion" show a strong positive correlation with "Potato" prices, suggesting that changes in these variables are associated with changes in potato prices. Conversely, variables like "Laddy_Finger" and "Rain" have a negative correlation, although "Rain" has a relatively weak negative correlation with "Potato" prices.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Assuming the first column is dates or non-numeric, we drop it for the PCA analysis
X_numeric = X.drop(X.columns[0], axis=1)
# We'll attempt to convert all columns to numeric values, in case there are any other non-numeric columns
X_numeric = X_numeric.apply(pd.to_numeric, errors='coerce')
# Filling any NaNs that were introduced by non-numeric values with column means
X_numeric = X_numeric.fillna(X_numeric.mean())
# Now we can scale and apply PCA
X_scaled = StandardScaler().fit_transform(X_numeric)
# PCA for feature engineering
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(X_scaled)
principalDf = pd.DataFrame(data=principalComponents, columns=['PC1', 'PC2'])
# Concatenating the target variable to plot it
finalDf = pd.concat([principalDf, y.reset_index(drop=True)], axis=1)
# Visualizing the PCA-transformed features with the target variable
plt.figure(figsize=(8, 6))
scatter = plt.scatter(finalDf['PC1'], finalDf['PC2'], c=finalDf['Potato'], cmap='viridis')
plt.colorbar(scatter)
plt.title('PCA of Agricultural Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True)
plt.show()
The dataset includes a variety of variables, such as prices for different vegetables (e.g., Cabbage, Capsicum, Cauliflower), Chicken, economic indicators (Economic Policy Uncertainty Index, Petrol Prices, USD to PKR exchange rate), and environmental factors (Temperature, Rain). "Potato" is identified as the dependent variable, which means the goal is to predict its prices based on other features.
Before we proceed with the correlation matrix and feature selection, we'll prepare the data by ensuring all variables are in the correct format since some columns appear to be read as objects due to inconsistent data formatting. Let's convert these columns to numerical values where necessary and proceed with the analysis.
The dataset contains missing values in several columns, including "Capsicum," "Garlic," "Green Chilli," "Laddy_Finger," "Pakistani_Banana," "Potato," "Tomato," and "Zucchini." To proceed with a clean dataset, we'll fill these missing values. Given the nature of the data (time series of commodity prices and environmental factors), a common approach would be to fill missing values with the column mean, which can maintain the overall distribution without introducing bias.
After handling the missing values, we'll compute the correlation matrix to understand the relationship between "Potato" prices and other features. Following that, we'll conduct iterative feature selection to identify the most relevant features for predicting potato prices and then apply Random Forest and SVM models to the dataset. Let's start by addressing the missing values.
The correlation matrix reveals how each feature is related to the price of potatoes. Notably, features such as "Egg," "Petrol Prices," and "Onion" show a strong positive correlation with "Potato" prices, suggesting that changes in these variables are associated with changes in potato prices. Conversely, variables like "Laddy_Finger" and "Rain" have a negative correlation, although "Rain" has a relatively weak negative correlation with "Potato" prices.
Next, we'll perform iterative feature selection to identify the most relevant features for predicting potato prices. This process involves fitting a model multiple times, each time with a different subset of features, and selecting the subset that results in the best model performance. We'll use this approach with both Random Forest and Support Vector Machine (SVM) models to determine the optimal features for each.
For Random Forest, feature importance provided by the model can also guide the selection process. For SVM, feature selection will be based on model performance metrics since SVM does not inherently provide feature importance scores.
# 'cleaned_data' is our DataFrame and 'Potato' is the target variable
X = cleaned_data.drop('Potato', axis=1) # Features
y = cleaned_data['Potato'] # Target variable
# Initialize the Random Forest model
rf_model = RandomForestRegressor()
# Recursive Feature Elimination with Cross-Validation (RFECV) for feature selection
rfecv_rf = RFECV(estimator=rf_model, step=1, cv=5, scoring='neg_mean_squared_error')
rfecv_rf.fit(X, y)
# Identifying the most important features based on Random Forest
important_features_rf = X.columns[rfecv_rf.support_]
print("Optimal number of features for Random Forest: ", rfecv_rf.n_features_)
print("Most important features for Random Forest: ", important_features_rf)
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-32-81283a801320> in <cell line: 10>() 8 # Recursive Feature Elimination with Cross-Validation (RFECV) for feature selection 9 rfecv_rf = RFECV(estimator=rf_model, step=1, cv=5, scoring='neg_mean_squared_error') ---> 10 rfecv_rf.fit(X, y) 11 12 # Identifying the most important features based on Random Forest /usr/local/lib/python3.10/dist-packages/sklearn/feature_selection/_rfe.py in fit(self, X, y, groups) 745 ) 746 --> 747 rfe.fit(X, y) 748 749 # Set final attributes /usr/local/lib/python3.10/dist-packages/sklearn/feature_selection/_rfe.py in fit(self, X, y, **fit_params) 249 """ 250 self._validate_params() --> 251 return self._fit(X, y, **fit_params) 252 253 def _fit(self, X, y, step_score=None, **fit_params): /usr/local/lib/python3.10/dist-packages/sklearn/feature_selection/_rfe.py in _fit(self, X, y, step_score, **fit_params) 297 print("Fitting estimator with %d features." % np.sum(support_)) 298 --> 299 estimator.fit(X[:, features], y, **fit_params) 300 301 # Get importance and rank them /usr/local/lib/python3.10/dist-packages/sklearn/ensemble/_forest.py in fit(self, X, y, sample_weight) 471 # parallel_backend contexts set at a higher level, 472 # since correctness does not rely on using threads. --> 473 trees = Parallel( 474 n_jobs=self.n_jobs, 475 verbose=self.verbose, /usr/local/lib/python3.10/dist-packages/sklearn/utils/parallel.py in __call__(self, iterable) 61 for delayed_func, args, kwargs in iterable 62 ) ---> 63 return super().__call__(iterable_with_config) 64 65 /usr/local/lib/python3.10/dist-packages/joblib/parallel.py in __call__(self, iterable) 1861 output = self._get_sequential_output(iterable) 1862 next(output) -> 1863 return output if self.return_generator else list(output) 1864 1865 # Let's create an ID that uniquely identifies the current call. If the /usr/local/lib/python3.10/dist-packages/joblib/parallel.py in _get_sequential_output(self, iterable) 1790 self.n_dispatched_batches += 1 1791 self.n_dispatched_tasks += 1 -> 1792 res = func(*args, **kwargs) 1793 self.n_completed_tasks += 1 1794 self.print_progress() /usr/local/lib/python3.10/dist-packages/sklearn/utils/parallel.py in __call__(self, *args, **kwargs) 121 config = {} 122 with config_context(**config): --> 123 return self.function(*args, **kwargs) /usr/local/lib/python3.10/dist-packages/sklearn/ensemble/_forest.py in _parallel_build_trees(tree, bootstrap, X, y, sample_weight, tree_idx, n_trees, verbose, class_weight, n_samples_bootstrap) 182 curr_sample_weight *= compute_sample_weight("balanced", y, indices=indices) 183 --> 184 tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False) 185 else: 186 tree.fit(X, y, sample_weight=sample_weight, check_input=False) /usr/local/lib/python3.10/dist-packages/sklearn/tree/_classes.py in fit(self, X, y, sample_weight, check_input) 1245 """ 1246 -> 1247 super().fit( 1248 X, 1249 y, /usr/local/lib/python3.10/dist-packages/sklearn/tree/_classes.py in fit(self, X, y, sample_weight, check_input) 377 ) 378 --> 379 builder.build(self.tree_, X, y, sample_weight) 380 381 if self.n_outputs_ == 1 and is_classifier(self): KeyboardInterrupt:
from sklearn.svm import SVR
# Initialize the SVM model
svm_model = SVR(kernel='linear') # Linear kernel is chosen for simplicity and interpretability
# Recursive Feature Elimination with Cross-Validation (RFECV) for feature selection
rfecv_svm = RFECV(estimator=svm_model, step=1, cv=5, scoring='neg_mean_squared_error')
rfecv_svm.fit(X, y)
# Identifying the most important features based on SVM performance
important_features_svm = X.columns[rfecv_svm.support_]
print("Optimal number of features for SVM: ", rfecv_svm.n_features_)
print("Most important features for SVM: ", important_features_svm)
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-33-47b92cd51eaf> in <cell line: 8>() 6 # Recursive Feature Elimination with Cross-Validation (RFECV) for feature selection 7 rfecv_svm = RFECV(estimator=svm_model, step=1, cv=5, scoring='neg_mean_squared_error') ----> 8 rfecv_svm.fit(X, y) 9 10 # Identifying the most important features based on SVM performance /usr/local/lib/python3.10/dist-packages/sklearn/feature_selection/_rfe.py in fit(self, X, y, groups) 723 func = delayed(_rfe_single_fit) 724 --> 725 scores = parallel( 726 func(rfe, self.estimator, X, y, train, test, scorer) 727 for train, test in cv.split(X, y, groups) /usr/local/lib/python3.10/dist-packages/sklearn/feature_selection/_rfe.py in <genexpr>(.0) 724 725 scores = parallel( --> 726 func(rfe, self.estimator, X, y, train, test, scorer) 727 for train, test in cv.split(X, y, groups) 728 ) /usr/local/lib/python3.10/dist-packages/sklearn/feature_selection/_rfe.py in _rfe_single_fit(rfe, estimator, X, y, train, test, scorer) 35 X_train, y_train = _safe_split(estimator, X, y, train) 36 X_test, y_test = _safe_split(estimator, X, y, test, train) ---> 37 return rfe._fit( 38 X_train, 39 y_train, /usr/local/lib/python3.10/dist-packages/sklearn/feature_selection/_rfe.py in _fit(self, X, y, step_score, **fit_params) 297 print("Fitting estimator with %d features." % np.sum(support_)) 298 --> 299 estimator.fit(X[:, features], y, **fit_params) 300 301 # Get importance and rank them /usr/local/lib/python3.10/dist-packages/sklearn/svm/_base.py in fit(self, X, y, sample_weight) 250 251 seed = rnd.randint(np.iinfo("i").max) --> 252 fit(X, y, sample_weight, solver_type, kernel, random_seed=seed) 253 # see comment on the other call to np.iinfo in this file 254 /usr/local/lib/python3.10/dist-packages/sklearn/svm/_base.py in _dense_fit(self, X, y, sample_weight, solver_type, kernel, random_seed) 329 self.fit_status_, 330 self._num_iter, --> 331 ) = libsvm.fit( 332 X, 333 y, sklearn/svm/_libsvm.pyx in sklearn.svm._libsvm.fit() KeyboardInterrupt:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# 'cleaned_data' is our DataFrame and 'Potato' is the target variable
X = cleaned_data.drop('Potato', axis=1) # Features
y = cleaned_data['Potato'] # Target variable
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the models
models = {
"Random Forest Regressor": RandomForestRegressor(),
"SVM Regressor": SVR(),
"Linear Regression": LinearRegression()
}
# Dictionary to hold training and testing scores
scores = {model: {'Training R^2': 0, 'Testing R^2': 0} for model in models}
# Initialize an empty dictionary to store the metrics
performance_metrics = {}
# Loop through each model to train, predict, and calculate metrics
for model_name, model in models.items():
model.fit(X_train, y_train) # Train model
y_train_pred = model.predict(X_train) # Predictions on training set
y_test_pred = model.predict(X_test) # Predictions on testing set
# Calculate metrics
metrics = {
'Training MAE': mean_absolute_error(y_train, y_train_pred),
'Training MSE': mean_squared_error(y_train, y_train_pred),
'Training RMSE': np.sqrt(mean_squared_error(y_train, y_train_pred)),
'Training R^2': r2_score(y_train, y_train_pred),
'Testing MAE': mean_absolute_error(y_test, y_test_pred),
'Testing MSE': mean_squared_error(y_test, y_test_pred),
'Testing RMSE': np.sqrt(mean_squared_error(y_test, y_test_pred)),
'Testing R^2': r2_score(y_test, y_test_pred),
}
performance_metrics[model_name] = metrics
# Convert the metrics dictionary to a DataFrame for a nicer display
performance_df = pd.DataFrame(performance_metrics).transpose()
# Show the DataFrame
print(performance_df)
Training MAE Training MSE Training RMSE \
Random Forest Regressor 0.512033 0.661401 0.813266
SVM Regressor 3.838637 27.800915 5.272657
Linear Regression 4.884127 40.634700 6.374535
Training R^2 Testing MAE Testing MSE Testing RMSE \
Random Forest Regressor 0.997104 1.473411 7.798689 2.792613
SVM Regressor 0.878253 3.710615 25.418087 5.041635
Linear Regression 0.822051 4.544789 38.049863 6.168457
Testing R^2
Random Forest Regressor 0.961175
SVM Regressor 0.873459
Linear Regression 0.810574
# Adjusted function for styling the DataFrame
def highlight_best(data, color='limegreen'):
'''
Highlight the cell in a specified color based on condition.
Lower values are highlighted for error metrics (MAE, MSE, RMSE),
and higher values are highlighted for R^2 scores.
'''
attr = f'background-color: {color}'
if data.ndim == 1: # Single column
is_max = data == data.max()
is_min = data == data.min()
return [attr if (col.endswith('R^2') and val) else '' for col, val in is_max.items()] + \
[attr if (not col.endswith('R^2') and val) else '' for col, val in is_min.items()]
else: # DataFrame
output = pd.DataFrame('', index=data.index, columns=data.columns)
for col in data.columns:
if 'R^2' in col:
output[col] = np.where(data[col] == data[col].max(), attr, '')
else:
output[col] = np.where(data[col] == data[col].min(), attr, '')
return output
# Applying the adjusted styling
styled_df = performance_df.style.apply(highlight_best, axis=None).format("{:.3f}")
# Display the styled DataFrame
styled_df
| Training MAE | Training MSE | Training RMSE | Training R^2 | Testing MAE | Testing MSE | Testing RMSE | Testing R^2 | |
|---|---|---|---|---|---|---|---|---|
| Random Forest Regressor | 0.512 | 0.661 | 0.813 | 0.997 | 1.473 | 7.799 | 2.793 | 0.961 |
| SVM Regressor | 3.839 | 27.801 | 5.273 | 0.878 | 3.711 | 25.418 | 5.042 | 0.873 |
| Linear Regression | 4.884 | 40.635 | 6.375 | 0.822 | 4.545 | 38.050 | 6.168 | 0.811 |
The results show the performance of three different regression models (Random Forest, SVM, and Linear Regression) on a dataset for predicting potato prices, based on their R-squared (R^2) values on both training and testing datasets. The R^2 value is a statistical measure that represents the proportion of the variance for the dependent variable that's explained by the independent variables in a regression model. It ranges from 0 to 1, where a higher value indicates a better fit of the model to the data. Let's interpret each model's performance:
The results we've obtained from modeling efforts using Random Forest Regressor, SVM Regressor, and Linear Regression provide a detailed insight into each model's performance on our agricultural price prediction task. Let's interpret these results in light of our research objectives outlined at the beginning of this thread, which focus on enhancing the accuracy of price predictions in agriculture through machine learning techniques.
Our primary research objective is to enhance the accuracy of agricultural price predictions, which requires models that can effectively capture and predict complex relationships between various influencing factors. The results demonstrate that the Random Forest Regressor aligns best with this objective, due to its superior ability to model nonlinear relationships and interactions among features without requiring explicit feature transformation or selection. This is critical in the context of agricultural data, where many variables interact in complex ways to influence prices.
The robust performance of the Random Forest model, particularly its ability to generalize well to unseen data as evidenced by its testing set performance, suggests it is a promising approach for developing a comprehensive pricing prediction model. Its high accuracy and reliability could significantly contribute to stabilizing market prices and aiding stakeholders across the agricultural supply chain in making informed decisions, thereby enhancing economic security and sustainability in the sector, as outlined in our research goals.
Given the strong performance of the Random Forest model, future work could focus on further tuning its parameters, exploring feature engineering to capture temporal and spatial patterns, and incorporating additional data sources that may provide deeper insights into price fluctuations. Additionally, considering ensemble methods or advanced machine learning techniques like deep learning could offer further improvements in predictive accuracy, aligning with the goal of leveraging cutting-edge ML algorithms to address the challenges faced by the agricultural sector.
it is a crucial step in improving our machine learning models' performance by adjusting their parameters. Since the Random Forest Regressor shows the best performance among the models we've tested, let's focus on tuning it. We'll use GridSearchCV from scikit-learn, which allows us to search over specified parameter values for an estimator.
Here's we implement model tuning for the Random Forest Regressor:
# Define the parameter grid to search over
param_grid = {
'n_estimators': [100, 200, 300], # Number of trees in the forest
'max_depth': [None, 10, 20, 30], # Maximum depth of the tree
'min_samples_split': [2, 5, 10], # Minimum number of samples required to split an internal node
'min_samples_leaf': [1, 2, 4] # Minimum number of samples required to be at a leaf node
}
# Initialize the Random Forest model
rf = RandomForestRegressor(random_state=42)
# Initialize GridSearchCV with the parameter grid
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2, scoring='neg_mean_squared_error')
# Fit GridSearchCV to the training data
grid_search.fit(X_train, y_train)
# Print the best parameters and the best score (MSE)
print("Best parameters:", grid_search.best_params_)
print("Best score (MSE):", grid_search.best_score_)
# Use the best estimator for further predictions
best_rf_model = grid_search.best_estimator_
# We can now assess the performance of 'best_rf_model' on our test set, as before
y_test_pred = best_rf_model.predict(X_test)
# Calculate and print performance metrics for the test set
test_mae = mean_absolute_error(y_test, y_test_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
test_rmse = np.sqrt(test_mse)
test_r2 = r2_score(y_test, y_test_pred)
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Best parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}
Best score (MSE): -5.434561691626408
# Calculate performance metrics for the training set with the tuned model
y_train_pred_tuned = best_rf_model.predict(X_train)
# Calculating metrics for the training set
train_mae_tuned = mean_absolute_error(y_train, y_train_pred_tuned)
train_mse_tuned = mean_squared_error(y_train, y_train_pred_tuned)
train_rmse_tuned = np.sqrt(train_mse_tuned)
train_r2_tuned = r2_score(y_train, y_train_pred_tuned)
# Adding tuned model metrics to the performance metrics dictionary
performance_metrics['Tuned Random Forest Regressor'] = {
'Training MAE': train_mae_tuned,
'Training MSE': train_mse_tuned,
'Training RMSE': train_rmse_tuned,
'Training R^2': train_r2_tuned,
'Testing MAE': test_mae,
'Testing MSE': test_mse,
'Testing RMSE': test_rmse,
'Testing R^2': test_r2,
}
# Convert the updated metrics dictionary to a DataFrame
performance_df = pd.DataFrame(performance_metrics).transpose()
# Define the styling function
def highlight_best(data, color='limegreen'):
attr = f'background-color: {color}'
if data.ndim == 1: # Single column
is_max = data == data.max()
is_min = data == data.min()
return [attr if (col.endswith('R^2') and val) else '' for col, val in is_max.items()] + \
[attr if (not col.endswith('R^2') and val) else '' for col, val in is_min.items()]
else: # DataFrame
output = pd.DataFrame('', index=data.index, columns=data.columns)
for col in data.columns:
if 'R^2' in col:
output[col] = np.where(data[col] == data[col].max(), attr, '')
else:
output[col] = np.where(data[col] == data[col].min(), attr, '')
return output
# Applying the adjusted styling
styled_df = performance_df.style.apply(highlight_best, axis=None).format("{:.3f}")
# Display the styled DataFrame
styled_df
| Training MAE | Training MSE | Training RMSE | Training R^2 | Testing MAE | Testing MSE | Testing RMSE | Testing R^2 | |
|---|---|---|---|---|---|---|---|---|
| Random Forest Regressor | 0.512 | 0.661 | 0.813 | 0.997 | 1.473 | 7.799 | 2.793 | 0.961 |
| SVM Regressor | 3.839 | 27.801 | 5.273 | 0.878 | 3.711 | 25.418 | 5.042 | 0.873 |
| Linear Regression | 4.884 | 40.635 | 6.375 | 0.822 | 4.545 | 38.050 | 6.168 | 0.811 |
| Tuned Random Forest Regressor | 0.505 | 0.621 | 0.788 | 0.997 | 1.444 | 7.466 | 2.732 | 0.963 |
import matplotlib.pyplot as plt
import numpy as np
# Define metrics for Random Forest Regressor before tuning
rf_testing_mae = 1.473
rf_testing_mse = 7.799
rf_testing_rmse = 2.793
rf_testing_r2 = 0.961
# Define metrics for Tuned Random Forest Regressor
tuned_rf_testing_mae = 1.444
tuned_rf_testing_mse = 7.466
tuned_rf_testing_rmse = 2.732
tuned_rf_testing_r2 = 0.963
# Labels for the metrics
metrics = ['Testing MAE', 'Testing MSE', 'Testing RMSE', 'Testing R^2']
# Values for each model
rf_values = [rf_testing_mae, rf_testing_mse, rf_testing_rmse, rf_testing_r2]
tuned_rf_values = [tuned_rf_testing_mae, tuned_rf_testing_mse, tuned_rf_testing_rmse, tuned_rf_testing_r2]
# Setting the positions and width for the bars
pos = np.arange(len(metrics))
width = 0.35
# Plotting the bars
fig, ax = plt.subplots(figsize=(10, 6))
bar1 = ax.bar(pos - width/2, rf_values, width, label='Random Forest Regressor', hatch='//', edgecolor='black')
bar2 = ax.bar(pos + width/2, tuned_rf_values, width, label='Tuned Random Forest Regressor', hatch='\\\\', edgecolor='black')
# Adding the labels, title, and legend
ax.set_ylabel('Scores')
ax.set_title('Comparison of Testing Results: Random Forest vs. Tuned Random Forest')
ax.set_xticks(pos)
ax.set_xticklabels(metrics)
ax.legend()
# Adding the values on top of the bars for better visibility in print
def add_values(bars):
for bar in bars:
height = bar.get_height()
ax.annotate(f'{height:.3f}',
xy=(bar.get_x() + bar.get_width() / 2, height),
xytext=(0, 3), # 3 points vertical offset
textcoords='offset points',
ha='center', va='bottom')
add_values(bar1)
add_values(bar2)
plt.show()
The newly tuned Random Forest Regressor results after tuning offer a more refined insight into the potential of machine learning models to meet our research objectives. Here's a detailed interpretation of these results, especially focusing on the Tuned Random Forest Regressor, and how these findings align with our goals of enhancing agricultural price predictions:
Our primary objective is to create accurate and reliable predictive models for agricultural prices, a task fraught with challenges due to the complex interplay of factors affecting these prices. The improvements seen with the Tuned Random Forest Regressor, particularly in its testing set performance, align closely with this goal. By marginally increasing the model's accuracy and maintaining its high R^2 score, we're moving closer to developing a tool that can offer substantial benefits to stakeholders across the agricultural sector, from farmers to policymakers.
The slight improvements in predictive accuracy with the tuned model underscore the potential of advanced machine learning techniques to navigate the complexities of agricultural markets. This aligns with the broader objective of utilizing cutting-edge technology to enhance economic security and sustainability within agriculture.
The Random Forest Regressor, particularly in its tuned form, stands out as the most effective model among those tested. Its superior performance across both training and testing sets underscores its capability to address the intricacies of price prediction in the agricultural domain.
Moving forward, further exploration and optimization could include:
In conclusion, the results post-tuning provide a promising outlook for leveraging machine learning to fulfill our research objectives, offering a blend of high accuracy and robustness essential for predictive models in the agricultural sector.
# Initialize the dictionary with the initial set of models
models = {
"Random Forest Regressor": RandomForestRegressor(),
"SVM Regressor": SVR(),
"Linear Regression": LinearRegression()
}
# Adding new models: Gradient Boosting, AdaBoost, SLNN, MLNN
new_models = {
"Gradient Boosting Regressor": GradientBoostingRegressor(),
"AdaBoost Regressor": AdaBoostRegressor(),
"SLNN Regressor": MLPRegressor(hidden_layer_sizes=(64,), max_iter=500),
"MLNN Regressor": MLPRegressor(hidden_layer_sizes=(64, 64), max_iter=500),
}
# Update the original models dictionary with the new models
models.update(new_models)
# Initialize an empty dictionary to store the metrics
performance_metrics = {}
# Loop through each model to train, predict, and calculate metrics
for model_name, model in models.items():
model.fit(X_train, y_train) # Train model
y_train_pred = model.predict(X_train) # Predictions on training set
y_test_pred = model.predict(X_test) # Predictions on testing set
# Calculate metrics
metrics = {
'Training MAE': mean_absolute_error(y_train, y_train_pred),
'Training MSE': mean_squared_error(y_train, y_train_pred),
'Training RMSE': np.sqrt(mean_squared_error(y_train, y_train_pred)),
'Training R^2': r2_score(y_train, y_train_pred),
'Testing MAE': mean_absolute_error(y_test, y_test_pred),
'Testing MSE': mean_squared_error(y_test, y_test_pred),
'Testing RMSE': np.sqrt(mean_squared_error(y_test, y_test_pred)),
'Testing R^2': r2_score(y_test, y_test_pred),
}
performance_metrics[model_name] = metrics
# Convert the metrics dictionary to a DataFrame for a nicer display
performance_df2 = pd.DataFrame(performance_metrics).transpose()
# Define the styling function
def highlight_best(data, color='limegreen'):
attr = f'background-color: {color}'
output = pd.DataFrame('', index=data.index, columns=data.columns)
for col in data.columns:
if 'R^2' in col:
output[col] = np.where(data[col] == data[col].max(), attr, '')
else:
output[col] = np.where(data[col] == data[col].min(), attr, '')
return output
# Applying the adjusted styling
styled_df2 = performance_df2.style.apply(highlight_best, axis=None).format("{:.3f}")
# Display the styled DataFrame
styled_df2
| Training MAE | Training MSE | Training RMSE | Training R^2 | Testing MAE | Testing MSE | Testing RMSE | Testing R^2 | |
|---|---|---|---|---|---|---|---|---|
| Random Forest Regressor | 0.511 | 0.666 | 0.816 | 0.997 | 1.468 | 7.589 | 2.755 | 0.962 |
| SVM Regressor | 3.839 | 27.801 | 5.273 | 0.878 | 3.711 | 25.418 | 5.042 | 0.873 |
| Linear Regression | 4.884 | 40.635 | 6.375 | 0.822 | 4.545 | 38.050 | 6.168 | 0.811 |
| Gradient Boosting Regressor | 1.645 | 4.627 | 2.151 | 0.980 | 2.089 | 9.307 | 3.051 | 0.954 |
| AdaBoost Regressor | 3.641 | 18.130 | 4.258 | 0.921 | 3.769 | 19.984 | 4.470 | 0.901 |
| SLNN Regressor | 2.382 | 9.252 | 3.042 | 0.959 | 2.550 | 11.952 | 3.457 | 0.940 |
| MLNN Regressor | 1.904 | 6.131 | 2.476 | 0.973 | 2.320 | 9.839 | 3.137 | 0.951 |
# Initialize the dictionary with the initial set of models
# Initialize the dictionary with the initial set of models including the Tuned Random Forest Regressor
models = {
"Tuned Random Forest Regressor": RandomForestRegressor(**tuned_parameters),
"Random Forest Regressor": RandomForestRegressor(),
"SVM Regressor": SVR(),
"Linear Regression": LinearRegression()
}
# Adding new models: Gradient Boosting, AdaBoost, SLNN, MLNN
new_models = {
"Gradient Boosting Regressor": GradientBoostingRegressor(),
"AdaBoost Regressor": AdaBoostRegressor(),
"SLNN Regressor": MLPRegressor(hidden_layer_sizes=(64,), max_iter=500),
"MLNN Regressor": MLPRegressor(hidden_layer_sizes=(64, 64), max_iter=500),
}
# Update the original models dictionary with the new models
models.update(new_models)
# Initialize an empty dictionary to store the metrics
performance_metrics = {}
# Loop through each model to train, predict, and calculate metrics
for model_name, model in models.items():
model.fit(X_train, y_train) # Train model
y_train_pred = model.predict(X_train) # Predictions on training set
y_test_pred = model.predict(X_test) # Predictions on testing set
# Calculate metrics
metrics = {
'Testing MAE': mean_absolute_error(y_test, y_test_pred),
'Testing MSE': mean_squared_error(y_test, y_test_pred),
'Testing RMSE': np.sqrt(mean_squared_error(y_test, y_test_pred)),
'Testing R^2': r2_score(y_test, y_test_pred),
}
performance_metrics[model_name] = metrics
# Convert the metrics dictionary to a DataFrame for a nicer display
performance_df2 = pd.DataFrame(performance_metrics).transpose()
# Define the styling function
def highlight_best(data, color='limegreen'):
attr = f'background-color: {color}'
output = pd.DataFrame('', index=data.index, columns=data.columns)
for col in data.columns:
if 'R^2' in col:
output[col] = np.where(data[col] == data[col].max(), attr, '')
else:
output[col] = np.where(data[col] == data[col].min(), attr, '')
return output
# Applying the adjusted styling
styled_df2 = performance_df2.style.apply(highlight_best, axis=None).format("{:.3f}")
# Display the styled DataFrame
styled_df2
| Testing MAE | Testing MSE | Testing RMSE | Testing R^2 | |
|---|---|---|---|---|
| Tuned Random Forest Regressor | 1.447 | 7.297 | 2.701 | 0.964 |
| Random Forest Regressor | 1.480 | 7.727 | 2.780 | 0.962 |
| SVM Regressor | 3.711 | 25.418 | 5.042 | 0.873 |
| Linear Regression | 4.545 | 38.050 | 6.168 | 0.811 |
| Gradient Boosting Regressor | 2.088 | 9.295 | 3.049 | 0.954 |
| AdaBoost Regressor | 3.826 | 21.444 | 4.631 | 0.893 |
| SLNN Regressor | 3.101 | 18.407 | 4.290 | 0.908 |
| MLNN Regressor | 2.829 | 15.956 | 3.994 | 0.921 |
The addition of four new models—Gradient Boosting Regressor, AdaBoost Regressor, Single-Layer Neural Network (SLNN) Regressor, and Multi-Layer Neural Network (MLNN) Regressor—alongside the traditional and tuned Random Forest models, provides a comprehensive comparison of commonly used regression models in the context of agricultural price prediction. The aim was to explore a range of models to find the best performer based on standard metrics: Testing Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R^2).
Tuned Random Forest Regressor emerges as the top performer with the highest R^2 score (0.964) and the lowest error metrics (MAE: 1.447, MSE: 7.297, RMSE: 2.701), indicating its superior ability to predict agricultural prices accurately. The tuning process refined its parameters to optimize performance.
Gradient Boosting Regressor shows impressive results, with a high R^2 score (0.954) and relatively low error metrics (MAE: 2.088, MSE: 9.295, RMSE: 3.049), making it a strong candidate for this task. Its performance underscores the effectiveness of ensemble learning methods that combine multiple weak predictors to form a robust model.
AdaBoost Regressor and SLNN Regressor offer moderate performance with R^2 scores of 0.893 and 0.908, respectively. While not matching the top performers, their results are respectable and highlight the diversity in model capabilities. AdaBoost, another ensemble method, and SLNN, a simple neural network, demonstrate their potential utility in regression tasks with complex datasets.
MLNN Regressor achieves a higher R^2 score (0.921) than SLNN, suggesting that adding more layers can capture more complex patterns in the data, though at the cost of increased computational complexity.
SVM Regressor and Linear Regression lag behind the other models, with R^2 scores of 0.873 and 0.811, respectively. Their lower performance highlights the challenges these models face with the complex nonlinear relationships and high dimensionality typical of agricultural price data.
The exploration of these models was driven by the need to benchmark the performance of various regression techniques against the agricultural price prediction task. It revealed that ensemble methods and neural networks, especially when fine-tuned, tend to offer superior performance over traditional linear models and SVM in handling the complexities of agricultural price data. The Tuned Random Forest Regressor, in particular, stands out for its exceptional accuracy, underscoring the value of model tuning in optimizing machine learning solutions for real-world applications. This comprehensive model comparison helps in identifying the most effective techniques for predicting agricultural prices, aligning with the research objective to leverage advanced machine learning for enhancing decision-making in the agricultural sector.
The results from testing various machine learning models on agricultural price data provide insightful answers to your research questions. Let's delve into each question based on the findings:
The exploration of multiple machine learning models, including Random Forest, Gradient Boosting, AdaBoost, SLNN, and MLNN, alongside traditional models like SVM and Linear Regression, has demonstrated that machine learning can significantly enhance the accuracy of agricultural price predictions. The Tuned Random Forest Regressor, in particular, showcased the highest accuracy, as evidenced by its superior R^2 score and lower error metrics.
While the initial model comparisons primarily focused on the performance of different machine learning algorithms, incorporating socioeconomic variables into these models is crucial for capturing the broader context that influences agricultural prices. Variables such as economic policy uncertainty, fuel prices, and exchange rates can significantly impact agricultural costs and, subsequently, prices.
The relationships and interconnections between different agricultural commodities can have profound impacts on price projections. For example, the price of one commodity might influence another due to substitution effects or shared supply chain dependencies.
The findings from testing various machine learning models provide a solid foundation for addressing the outlined research questions. They underline the potential of machine learning to enhance the accuracy of agricultural price predictions by effectively capturing complex patterns and relationships within the data. Incorporating socioeconomic variables and understanding the interconnections between different agricultural commodities are essential steps toward developing comprehensive and predictive models. These insights align with the goal of leveraging advanced machine learning techniques to aid decision-making processes in agriculture, ensuring economic security and sustainability.